========================================================
The dataset which is related to white variants of the Portuguese “Vinho Verde” wine, consists of several physicochemical sample test values and an output sensory variable. The output value is the median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Variables info:
Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavour to wines.
Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 g/L and wines with greater than 45 g/L are considered sweet.
Chlorides: the amount of salt in the wine.
Free sulphur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulphite ion; it prevents microbial growth and the oxidation of wine.
Total sulphur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
Density: the density of water is close to that of water depending on the percent alcohol and sugar content.
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Sulphates: a wine additive which can contribute to sulphur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
Alcohol: the percent alcohol content of the wine.
Quality: output variable based on sensory data with values between 0 and 10.
# Load Data
wines <- read.csv("wineQualityWhites.csv")
# View data
formattable(wines, list(area(col = c(quality))
~ normalize_bar("yellow", 0.2))) %>% as.datatable()
Take a look of the data structure and get a statistical summary.
# Create a new bound form of SO2 variable.
wines$bound.sulfur.dioxide <- wines$total.sulfur.dioxide -
wines$free.sulfur.dioxide
# Create total acidity.
wines$total.acidity <- wines$fixed.acidity +
wines$volatile.acidity
# Create a rating variable from quality.
wines$rating <- NA
wines$rating <- ifelse(wines$quality < 5, "Undrinkable",
ifelse(wines$quality < 6, "Drinkable",
ifelse(wines$quality < 7, "Average",
ifelse(wines$quality < 8, "Good", "Great"))))
wines$rating <- factor(wines$rating)
wines$rating <- ordered(wines$rating, levels = c("Undrinkable", "Drinkable",
"Average", "Good", "Great"))
# str(wines) # uncomment to see the structure of wines dataset.
pandoc.table(summary(wines))
| X | fixed.acidity | volatile.acidity | citric.acid | residual.sugar |
|---|---|---|---|---|
| Min. : 1 | Min. : 3.800 | Min. :0.0800 | Min. :0.0000 | Min. : 0.600 |
| 1st Qu.:1225 | 1st Qu.: 6.300 | 1st Qu.:0.2100 | 1st Qu.:0.2700 | 1st Qu.: 1.700 |
| Median :2450 | Median : 6.800 | Median :0.2600 | Median :0.3200 | Median : 5.200 |
| Mean :2450 | Mean : 6.855 | Mean :0.2782 | Mean :0.3342 | Mean : 6.391 |
| 3rd Qu.:3674 | 3rd Qu.: 7.300 | 3rd Qu.:0.3200 | 3rd Qu.:0.3900 | 3rd Qu.: 9.900 |
| Max. :4898 | Max. :14.200 | Max. :1.1000 | Max. :1.6600 | Max. :65.800 |
| chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density |
|---|---|---|---|
| Min. :0.00900 | Min. : 2.00 | Min. : 9.0 | Min. :0.9871 |
| 1st Qu.:0.03600 | 1st Qu.: 23.00 | 1st Qu.:108.0 | 1st Qu.:0.9917 |
| Median :0.04300 | Median : 34.00 | Median :134.0 | Median :0.9937 |
| Mean :0.04577 | Mean : 35.31 | Mean :138.4 | Mean :0.9940 |
| 3rd Qu.:0.05000 | 3rd Qu.: 46.00 | 3rd Qu.:167.0 | 3rd Qu.:0.9961 |
| Max. :0.34600 | Max. :289.00 | Max. :440.0 | Max. :1.0390 |
| pH | sulphates | alcohol | quality | bound.sulfur.dioxide |
|---|---|---|---|---|
| Min. :2.720 | Min. :0.2200 | Min. : 8.00 | Min. :3.000 | Min. : 4.0 |
| 1st Qu.:3.090 | 1st Qu.:0.4100 | 1st Qu.: 9.50 | 1st Qu.:5.000 | 1st Qu.: 78.0 |
| Median :3.180 | Median :0.4700 | Median :10.40 | Median :6.000 | Median :100.0 |
| Mean :3.188 | Mean :0.4898 | Mean :10.51 | Mean :5.878 | Mean :103.1 |
| 3rd Qu.:3.280 | 3rd Qu.:0.5500 | 3rd Qu.:11.40 | 3rd Qu.:6.000 | 3rd Qu.:125.0 |
| Max. :3.820 | Max. :1.0800 | Max. :14.20 | Max. :9.000 | Max. :331.0 |
| total.acidity | rating |
|---|---|
| Min. : 4.110 | Undrinkable: 183 |
| 1st Qu.: 6.570 | Drinkable :1457 |
| Median : 7.070 | Average :2198 |
| Mean : 7.133 | Good : 880 |
| 3rd Qu.: 7.590 | Great : 180 |
| Max. :14.470 | NA |
The white wines dataset includes 12 variables with almost 5000 observations. Residual sugar concentration is too low for most of the wines with a maximum value at about 66 g/L and 3rd quantile at 9.9. Therefore, there are only few sweet white wines in the list, since sweetness considered for values greater than 45 g/L.
For most wines free sulphur dioxide is below 50 ppm. Values of pH ranges from 2.7 to 4 and alcohol content ranges from 8 to 14. Quality ranges only from 3 to 9 (no 0, 1, 2 and 10 values) with a median at 6.
The following histograms will show the distribution of the data. Boxplots will help to depict the outlies, although in the following analysis all of the data were used and no outliers were removed. Since the data were derived from chemical tests, it can be safely considered that for each observation at least 2 measurements were conducted in order to handle experimental errors. As it is known, an outlier may be due to variability in the measurement or it may indicate experimental error; the latter only is sometimes excluded from the dataset.
# Create a function for plotting.
f_plot1 <- function(x, bins) {
grid.arrange(ggplot(data = wines, aes_string(x = 1, y = x)) +
geom_jitter(alpha = 0.05) +
geom_boxplot(alpha = 0.2, color = "blue"),
ggplot(data = wines, aes_string( x = x)) +
geom_histogram(bins = bins), ncol = 2)
}
f_plot1("total.acidity", 40)
Distribution of total acidity is normal with few values less than 4 and a long tail after 9. Most values are observed at 7-8.
f_plot1("citric.acid", 35)
It seems that citric acid has some 0 values and few outliers after 0.6. The interquartile range (IQR) is from 0.27 to 0.39.
f_plot1("residual.sugar", 50)
There is a positive skewness for residual sugar. The transformation of x axis to the log10 scale, as well as the tuning of the bins may reveal something.
ggplot(data = wines, aes( x = residual.sugar)) +
geom_histogram(bins = 40) +
scale_x_log10()
Indeed, residual sugar distribution is bimodal with peaks at about 2.5 and around 10.
f_plot1("chlorides", 35)
Some tweaking is necessary to help visualise the histogram of chlorides.
ggplot(data = wines, aes( x = chlorides)) +
geom_histogram(bins = 35) +
scale_x_continuous(limit = c(0, 0.1), breaks = seq(0, 0.1, 0.01))
Chlorides follow a normal distribution with 50% of values from 0.036 to 0.05.
f_plot1("total.sulfur.dioxide", 40)
Total sulphur dioxide depicts a wide range from 9 to 440. An x axis transformation to the log10 scale may show more information.
ggplot(data = wines, aes( x = total.sulfur.dioxide)) +
geom_histogram(bins = 60) +
scale_x_log10()
It looks like a negative skewed distribution, after the log10 scale conversion.
f_plot1("density", 30)
Density has a small range with an IQR from 0.9917 to 0.9961. Only very few wines have a density higher than 1 with a maximum at 1.0390.
f_plot1("pH", 55)
The distribution of pH is normal with values from 2.720 to 3.820 and a median at 3.18, which shows that wines are acidic.
f_plot1("sulphates", 22)
Sulphates’ distribution is positively skewed with a median at 0.47.
f_plot1("alcohol", 21)
Alcohol’s histogram seems to be positively skewed with values from 8 to 14.20.
p1 <- ggplot(data = wines, aes(x = rating)) +
geom_bar()
p2 <- ggplot(data = wines, aes(x = quality)) +
geom_density(fill = "lightblue", alpha = 0.5) +
scale_x_continuous(breaks = seq(3, 9, 1))
grid.arrange(p1, p2, ncol = 1)
Wines with high residual sugar concentrations probably have the higher density, as sugars contribute to density. Checking sweetness,
by(wines$density, wines$residual.sugar > 45, summary)
## wines$residual.sugar > 45: FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0103
## --------------------------------------------------------
## wines$residual.sugar > 45: TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.039 1.039 1.039 1.039 1.039 1.039
table(wines$residual.sugar > 45)
##
## FALSE TRUE
## 4897 1
it is found that there is only one sweet wine, which also has the higher observed density in the list.
Residual sugar is the only variable which follows a bimodal distribution. This clearly shows that there are two groups of wines:
Although quality is considered the most important feature, the relationships between all of the variables will be further studied.
It is known that pH, acidity, sugar content, SO2 concentrations and salinity (chlorides) play an important role in the taste, flavours, aromas, structure, colour and ageability of wines.
Three new variables were created:
Explore the correlations between the variables.
wines_corr_r <- round(cor(wines[c(2:13)]), 1)
par(xpd=TRUE)
corrplot(wines_corr_r, method = "number", mar = c(2, 0, 1, 0))
Several interesting correlations were found:
Examine further the above relationships visualizing them with scatterplots and encircling wines with quality over 9.
p1 <- ggplot(data = wines, aes(x = alcohol, y = rating)) +
geom_jitter(alpha = 0.1)
p2 <- ggplot(data = wines, aes(x = round(alcohol / 0.5) * 0.5, y = quality)) +
geom_line(stat = "summary", fun.y = mean)
p3 <- ggplot(data = wines, aes(x = residual.sugar, y = density)) +
geom_jitter(alpha = 0.05) +
xlim(0, 20) +
ylim(0.98, 1.01) +
geom_smooth(method = "lm", color = "blue") +
geom_encircle(aes(x = residual.sugar, y = density),
data = wines[wines$quality >= 9, ],
color = "orangered")
p4 <- ggplot(data = wines, aes(x = alcohol, y = density)) +
geom_jitter(alpha = 0.05) +
ylim(0.98, 1.01) +
geom_smooth(method = "lm", color = "blue") +
geom_encircle(aes(x = alcohol, y = density),
data = wines[wines$quality >= 9, ],
color = "orangered")
grid.arrange(p1, p2, p3, p4, ncol = 2)
At first glance, no visualising trend was revealed for quality-alcohol, even after applying jitter and changing transparency to prevent overplotting. However, the plot of alcohol vs. mean quality shows an increase in quality with alcohol number. Density-sugars pair exhibits a strong positive (+0.8) linear relationship and density-alcohol presents a negative one (-0.8).
p1 <- ggplot(data = wines, aes(x = chlorides, y = density)) +
geom_jitter(alpha = 0.05) +
xlim(0.01, 0.07) +
ylim(0.985, 1.005) +
geom_smooth(method = "lm", color = "blue")
p2 <- ggplot(data = wines, aes(x = total.sulfur.dioxide, y = density)) +
geom_jitter(alpha = 0.05) +
xlim(0, 300) +
ylim(0.985, 1.005) +
geom_smooth(method = "lm", color = "blue")
grid.arrange(p1, p2)
Salts content, i.e. chlorides and sulphates increase the density of the wine as sugars do too, but this time the correlation is weaker (+0.3).
ggplot(data = wines, aes(x = residual.sugar, y = alcohol)) +
geom_jitter(alpha = 0.1) +
xlim(0, 20) +
geom_smooth(method = "loess", color = "blue")
Residual sugar is negatively connected (-0.5) to alcohol, since it is the source of ethanol during winemaking.
ggplot(data = wines, aes(x = pH, y = fixed.acidity)) +
geom_jitter(alpha = 0.1) +
scale_y_log10() +
geom_smooth(method = "lm", color = "blue")
Generally, the lower the pH, the higher the acidity in the wine, which was also confirmed by a moderate negative correlation (-0.4).
ggplot(data = wines, aes(x = alcohol, y = total.sulfur.dioxide)) +
geom_jitter(alpha = 0.1) +
ylim(0, 300) +
geom_smooth(method = "lm", color = "blue")
Sulphur dioxide has some degree of inhibitory affect on yeast during fermentation (convertion of sugars to alcohol) and this may be a good reason of alcohol reduction with the increase of SO2.
ggplot(data = wines, aes(x = alcohol, y = chlorides)) +
geom_jitter(alpha = 0.1) +
ylim(0, 0.1) +
geom_smooth(method = "lm", color = "blue")
Chloride salts like SO2, present similar retarding effects on the activity of the yeast. That is the reason for the negative correlation -0.4.
# Create a function for plotting.
f_plot2 <- function(x, y) {
ggplot(wines, aes_string(x = "rating", y = y)) +
geom_boxplot(alpha = 0.5, color = "blue") +
geom_jitter(alpha = 0.02) +
stat_summary(fun.y = mean,
geom = "point",
shape = 8,
size = 3,
color = "blue") +
theme(legend.position = "none") +
theme(axis.title.x = element_blank())
}
f_plot2(y = "total.acidity") + ylim(5, 10)
The higher total acidity is observed for undrinkable wines. Let’s see if there is any trend in the scatter plot.
ggplot(data = wines, aes(x = total.acidity, y = quality)) +
geom_jitter(alpha = 0.1) +
xlim(5, 10) +
geom_smooth(method = "lm")
A very small decrease of quality with an increase in acidity is revealed.
f_plot2(y = "chlorides") + ylim(0, 0.1)
Undrinkable and drinkable wines have the higher chlorides. It is known that increased chloride concentrations give an undesirable soapy taste in wine.
ggplot(data = wines, aes(x = chlorides, y = quality)) +
geom_jitter(alpha = 0.1) +
xlim(0, 0.1) +
geom_smooth(method = "lm")
The more the chlorides the lower the quality.
p1 <- f_plot2(y = "free.sulfur.dioxide") + ylim(0, 100)
p2 <- f_plot2(y = "total.sulfur.dioxide") + ylim(0, 300)
grid.arrange(p1, p2, ncol = 2)
Undrinkable wines distinguish with the lowest free and total SO2. Probably these wines start to deteriorate, since the lack of SO2 favours unwanted microbial growth and oxidations.
ggplot(data = wines, aes(x = free.sulfur.dioxide, y = quality)) +
geom_jitter(alpha = 0.1) +
xlim(0, 100) +
geom_smooth(method = "lm")
Unfortunately, the study of the scatterplots of the above free SO2-quality pair reveals no visual trends.
p1 <- f_plot2(y = "citric.acid") + ylim(0, 0.75)
p2 <- f_plot2(y = "residual.sugar") + ylim(0, 20)
p3 <- f_plot2(y = "density") + ylim(0.99, 1)
p4 <- f_plot2(y = "pH")
p5 <- f_plot2(y = "sulphates") + ylim(0.2, 1)
p6 <- f_plot2(y = "alcohol")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)
It can be clearly seen the unusual trend of alcohol-quality pair (positive correlation +0.4) and therefore, the negative density-quality relation (-0.3), since alcohol and density are competitive factors (-0.8).
The main feature of interest, quality, seems to vary with alcohol, density, SO2, sugars and chlorides. Generally, a “bad” wine in this list has a higher acidity, chlorides and density, while its SO2 levels are lower.
It is worth noting the weak negative correlation of SO2-alcohol. A better understanding of the chemistry of SO2 can explain a competitive relation between SO2 and alcohol.
Alcohol is produced by the Saccharomyces yeast strains during alcoholic fermentation. Sulphur dioxide (SO2) is a natural by-product of winemaking as a small quantity is produced during the alcoholic fermentation by yeasts.
In practice however, SO2 is added by the winemakers either as a preservative or prior alcoholic fermentation to control the growth of microorganisms, which stops the production of alcohol. This benefits the flavours of white wines, since the enzyme polyphenol oxidase is inhibited and less oxidative browning of the juice occurs. This helps to preserve the fruity and floral aromas found in the juice.
The strongest relationships was between density and sugars or alcohol:
Classify wines to dry and off-dry type based on the residual sugar content. Create a new data frame, that contains information on quality and type of wine.
# Create a type variable from residual.sugar.
wines$type <- NA
wines$type <- ifelse(wines$residual.sugar < 5,
"dry", "off_dry")
wines.alcohol_by_type <- wines %>%
group_by(quality, type) %>%
summarise(mean_alcohol = mean(alcohol),
median_alcohol = median(alcohol),
n = n()) %>%
arrange(quality)
ggplot(data = wines.alcohol_by_type,
aes(x = quality, y = mean_alcohol)) +
geom_line(aes(color = type)) +
scale_x_continuous(breaks = seq(3, 9, 1))
There is an increasing trend in quality with alcohol which is more important for the dry wines but obviously more data are needed to make robust conclusions.
f_plot3 <- function(x, y) {
ggplot(data = wines,
aes_string(x = x, y = y, color = "rating")) +
geom_point(alpha = 0.5, size = 0.5) +
geom_smooth(method = "lm", se = FALSE, size = 1) +
scale_color_brewer(type='seq', guide = guide_legend(title = "Rating"))
}
p1 <- f_plot3(x = "residual.sugar", y = "density") +
xlim(0, 25) + ylim(0.985, 1.005)
p2 <- f_plot3(x = "alcohol", y = "density") +
ylim(0.985, 1.005)
p3 <- f_plot3(x = "chlorides", y = "density") +
ylim(0.985, 1.005) + xlim(0, 0.2)
p4 <- f_plot3(x = "total.sulfur.dioxide", y = "density") +
ylim(0.985, 1.005) + xlim(0, 300)
p5 <- f_plot3(x = "residual.sugar", y = "alcohol") +
xlim(0, 20)
p6 <- f_plot3(x = "pH", y = "fixed.acidity") +
ylim(4, 10)
p7 <- f_plot3(x = "alcohol", y = "chlorides") +
ylim(0, 0.15)
p8 <- f_plot3(x = "alcohol", y = "total.sulfur.dioxide") +
ylim(0, 300)
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)
The above plots depict a different visual behaviour (“the separation”) between lower and higher quality wines. For example, “good” and “great” wines have a higher increase of density than “average” wines due to chlorides content.
A first attempt to build a linear model and use the variables in the linear model to predict the quality of a wine was not very successful, since the biggest R-squared value is only 0.282.
m1 <- lm(quality ~ alcohol, data = wines)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + total.sulfur.dioxide)
m6 <- update(m5, ~ . + fixed.acidity)
m7 <- update(m6, ~ . + residual.sugar)
m8 <- update(m7, ~ . + pH)
m9 <- update(m8, ~ . + sulphates)
m10 <- update(m9, ~ . + free.sulfur.dioxide)
m11 <- update(m10, ~ . + citric.acid)
mtable(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, sdigits = 3)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wines)
## m2: lm(formula = quality ~ alcohol + density, data = wines)
## m3: lm(formula = quality ~ alcohol + density + volatile.acidity,
## data = wines)
## m4: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides, data = wines)
## m5: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide, data = wines)
## m6: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity, data = wines)
## m7: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar,
## data = wines)
## m8: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar +
## pH, data = wines)
## m9: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar +
## pH + sulphates, data = wines)
## m10: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar +
## pH + sulphates + free.sulfur.dioxide, data = wines)
## m11: lm(formula = quality ~ alcohol + density + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar +
## pH + sulphates + free.sulfur.dioxide + citric.acid, data = wines)
##
## ===============================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -36.499*** -35.573*** -30.759*** -43.308*** 60.251*** 130.584*** 162.786*** 149.901*** 150.193***
## (0.098) (6.165) (6.001) (6.010) (6.295) (6.493) (14.109) (17.934) (18.569) (18.760) (18.804)
## alcohol 0.313*** 0.360*** 0.399*** 0.389*** 0.391*** 0.407*** 0.305*** 0.222*** 0.184*** 0.194*** 0.193***
## (0.009) (0.015) (0.014) (0.015) (0.015) (0.015) (0.019) (0.023) (0.024) (0.024) (0.024)
## density 24.728*** 38.992*** 38.217*** 33.251*** 46.423*** -57.411*** -130.265*** -162.939*** -149.987*** -150.284***
## (6.079) (5.920) (5.926) (6.234) (6.458) (14.123) (18.195) (18.839) (19.029) (19.075)
## volatile.acidity -2.072*** -2.043*** -2.070*** -2.108*** -2.094*** -2.021*** -1.966*** -1.868*** -1.863***
## (0.110) (0.111) (0.111) (0.111) (0.110) (0.110) (0.110) (0.112) (0.114)
## chlorides -1.300* -1.370* -1.383* -0.858 -0.267 -0.153 -0.234 -0.247
## (0.542) (0.543) (0.540) (0.540) (0.546) (0.544) (0.543) (0.547)
## total.sulfur.dioxide 0.001* 0.001* 0.001** 0.001** 0.001* -0.000 -0.000
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## fixed.acidity -0.099*** -0.045** 0.044* 0.066** 0.066** 0.066**
## (0.014) (0.015) (0.021) (0.021) (0.021) (0.021)
## residual.sugar 0.045*** 0.075*** 0.087*** 0.081*** 0.081***
## (0.005) (0.007) (0.007) (0.008) (0.008)
## pH 0.665*** 0.707*** 0.684*** 0.686***
## (0.105) (0.105) (0.105) (0.105)
## sulphates 0.638*** 0.632*** 0.631***
## (0.100) (0.100) (0.100)
## free.sulfur.dioxide 0.004*** 0.004***
## (0.001) (0.001)
## citric.acid 0.022
## (0.096)
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.247 0.248 0.249 0.257 0.267 0.273 0.279 0.282 0.282
## adj. R-squared 0.190 0.192 0.246 0.247 0.248 0.256 0.266 0.272 0.278 0.280 0.280
## sigma 0.797 0.796 0.769 0.768 0.768 0.764 0.759 0.756 0.753 0.751 0.751
## F 1146.395 583.290 534.843 402.956 324.034 281.812 254.596 229.523 210.136 191.810 174.344
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5660.164 -5657.292 -5654.027 -5627.454 -5593.583 -5573.700 -5553.598 -5543.767 -5543.740
## Deviance 3112.257 3101.773 2892.625 2889.234 2885.385 2854.246 2815.042 2792.280 2769.454 2758.359 2758.329
## AIC 11684.782 11670.255 11330.329 11326.584 11322.054 11270.908 11205.165 11167.399 11129.197 11111.534 11113.480
## BIC 11704.272 11696.241 11362.812 11365.563 11367.530 11322.880 11263.634 11232.365 11200.659 11189.493 11197.936
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ===============================================================================================================================================================
The second attempt was to build a decision tree model, since it is more robust, handles non linearity and performs well with both numerical and categorical data. Two types of the wine dataset were examined:
fit <- rpart(rating ~ alcohol + density + volatile.acidity + chlorides +
total.sulfur.dioxide + fixed.acidity + residual.sugar +
pH + sulphates + free.sulfur.dioxide + citric.acid,
data = wines,
method="class")
rpart.plot(fit)
# Create new categorical variables with high-low levels.
wines$free.sulfur.dioxide_hilo <- NA
wines$free.sulfur.dioxide_hilo <- ifelse(wines$free.sulfur.dioxide < 50, 0, 1)
wines$free.sulfur.dioxide_hilo <- factor(wines$free.sulfur.dioxide_hilo)
wines$chlorides_hilo <- NA
wines$chlorides_hilo <- ifelse(wines$chlorides < 0.06, 0, 1)
wines$chlorides_hilo <- factor(wines$chlorides_hilo)
wines$volatile.acidity_hilo <- NA
wines$volatile.acidity_hilo <- ifelse(wines$volatile.acidity < 0.26, 0, 1)
wines$volatile.acidity_hilo <- factor(wines$volatile.acidity_hilo)
wines$citric.acid_hilo <- NA
wines$citric.acid_hilo <- ifelse(wines$citric.acid > 0, 1, 0)
wines$citric.acid_hilo <- factor(wines$citric.acid_hilo)
fit <- rpart(rating ~ alcohol +
density +
volatile.acidity_hilo +
chlorides_hilo +
free.sulfur.dioxide_hilo +
fixed.acidity +
residual.sugar +
pH +
sulphates +
total.sulfur.dioxide +
citric.acid_hilo,
data = wines,
method="class")
rpart.plot(fit)
Few of the variables seems to play a significant role in quality, such as alcohol, volatile acidity, free SO2 and sugars.
Several interesting relations were found and most of them were explained by literature reviewing. Some of the most important correlation pairs are:
It was really surprising the quite strong positive relationship of quality and alcohol, since ethanol is not considered tasty.
Two models, a linear and a decision tree model were created to predict the quality of a wine, but the results were unsatisfactory. Obviously, additional observations or variables are needed to build a good prediction model.
p1 <- ggplot(data = wines, aes(x = residual.sugar)) +
geom_histogram(binwidth = 0.06, colour = "white",
alpha = 0.8, aes(y = ..density.., fill = ..count..)) +
scale_fill_gradient("Count", low = "white", high = "lightblue") +
scale_x_log10() +
geom_density(colour = "lightblue") +
theme_classic() +
labs(title = "Histogram and Density Plot of Residual Sugar in White Wines",
x = "Residual Sugar in g/L",
y = "Density")
ggplotly(p1)
The distribution of the residual sugar appears to be bimodal on log scale, because there are two types of wine, the dry and the off-dry ones.
p2 <- plot_ly(data = wines[wines$density < 1.002, ],
x = ~density) %>%
add_markers(y = ~alcohol,
name = 'Alcohol',
marker = list(color = 'rgba(88, 116, 152, 0.9)'),
hoverinfo = "text",
text = ~paste(alcohol)) %>%
add_lines(y = ~fitted(lm(alcohol ~ density)),
line = list(color = '#E86850'),
name = "Linear smoother",
showlegend = FALSE) %>%
add_markers(y = ~residual.sugar,
name = 'Residual Sugar',
yaxis = 'y2',
hoverinfo = "text",
alpha = 0.5,
text = ~paste(residual.sugar, 'g/L')) %>%
add_lines(y = ~fitted(loess(residual.sugar ~ density)),
line = list(color = '#FFD800'),
name = "Loess Smoother",
showlegend = TRUE) %>%
layout(title = "How Alcohol and Residual Sugar Relate with Density",
xaxis = list(title = "Density"),
yaxis = list(side = 'left',
title = "Percentage of Alcohol Content",
showgrid = FALSE,
zeroline = FALSE),
yaxis2 = list(side = 'right',
overlaying = "y",
title = "Residual Sugar in g/L",
showgrid = FALSE,
zeroline = FALSE))
ggplotly(p2)
Alcohol in wines tend to decrease their density in contrast to sugars which increases the density. Actually, it seems to be a linear correlation between alcohol and density with R-squared at 61%.
summary(lm(I(density) ~ I(alcohol), data = wines))
##
## Call:
## lm(formula = I(density) ~ I(alcohol), data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005475 -0.001238 -0.000153 0.001156 0.047201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.014e+00 2.300e-04 4407.87 <2e-16 ***
## I(alcohol) -1.896e-03 2.173e-05 -87.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001871 on 4896 degrees of freedom
## Multiple R-squared: 0.6086, Adjusted R-squared: 0.6085
## F-statistic: 7613 on 1 and 4896 DF, p-value: < 2.2e-16
wines_good <- subset(wines, wines$free.sulfur.dioxide < 50 &
wines$volatile.acidity < 0.26 &
wines$citric.acid > 0 &
wines$chlorides < 0.06 &
wines$sulphates < 0.47)
p3 <- ggplot(data = wines_good,
aes(x = alcohol, y = density, color = rating)) +
geom_point(size = 2, alpha = 0.9) +
geom_smooth(method = "lm", se = FALSE, size = 0.8) +
scale_color_brewer(type='seq', guide = guide_legend(title = "Rating")) +
scale_x_continuous(limits = c(8.5, 14),
breaks = seq(8.5, 14, 0.5)) +
scale_y_continuous(limits = c(0.985, 1.001),
breaks = seq(0.985, 1.001, 0.002)) +
ggtitle("Alcohol vs. Density
with low SO2, Chlorides, Volatile Acidity
and high Citric Acid Aromas") +
labs(x = "Percentage of Alcohol Content",
y = "Density (g/mL)") +
theme_classic()
ggplotly(p3)
This plot shows again how density varies with alcohol number but this time only for selected wines with the following criteria:
This was an attempt to examine wines with desirable physicochemical properties, but unfortunately this bucket still contains all qualities. Nevertheless, density decreases with alcohol.
The white wines data set contains information on almost 5000 wines across 12 variables. Several interesting trends and relations were observed during the data exploration. There was a clear bimodal distribution of residual sugar which indicates two types of wines, dry and off-dry. The quality ranged from 5 to 7 for most of wines. There were strong correlations of density-sugars and density-alcohol, as it was expected, and an unusual positive relation of quality and alcohol. Several other relations were also investigated.
Eventually, a linear and a decision tree method were applied in order to build a prediction model of the quality using all the variables of the data set. It was found once again that alcohol presents the higher influence in wine quality, but the significance of the rest factors was not the desirable. Obviously, this data set is too small or there is a need for other more crucial variables to be investigated.
It is really difficult to tightly define with only 12 physicochemical properties the complexity of a wine, which can come from different things such as: